Minimax Games with Bandits
نویسندگان
چکیده
One of the earliest online learning games, now commonly known as the hedge setting [Freund and Schapire, 1997], goes as follows. On round t, a Learner chooses a distribution wt over a set of n actions, an Adversary reveals `t ∈ [0, 1], a vector of losses for each action, and the Learner suffers wt · `t = ∑n i=1 wt,i`t,i. Freund and Schapire [1997] showed that a very simple strategy of exponentially weighting the actions according to their cumulative losses provides a near-optimal guarantee. That is, by setting
منابع مشابه
Non-trivial two-armed partial-monitoring games are bandits
We consider online learning in partial-monitoring games against an oblivious adversary. We show that when the number of actions available to the learner is two and the game is nontrivial then it is reducible to a bandit-like game and thus the minimax regret is Θ( √ T ).
متن کاملMinimax Policies for Bandits Games
This work deals with four classical prediction games, namely full information, bandit and label efficient (full information or bandit) games as well as three different notions of regret: pseudo-regret, expected regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified analysis...
متن کاملBatched Bandit Problems
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy that operates under this contraint and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optima...
متن کاملProvably Optimal Algorithms for Generalized Linear Contextual Bandits
Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search. Generalized linear models (logistical regression in particular) have demonstrated stronger performance than linear models in many applications where rewards are binary. However, most theoretical analyses on contextual bandits so far are on linear bandits. In this work, we propose ...
متن کاملRegret Bounds and Minimax Policies under Partial Monitoring
This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudoregret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a u...
متن کامل